-
Notifications
You must be signed in to change notification settings - Fork 197
Integration test to check data re-ingest when switching runtimes #10544
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This pull request does not have a backport label. Could you fix it @leehinman? 🙏
|
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
// It is possible to have a batch in flight not get acked | ||
// before elastic-agent shutsdown. That can lead to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to wait (for a reasonable amount of time, maybe a few seconds?) on that last batch to be indexed before restarting Agent in the test? That would make the test more deterministic IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test is checking to see if the "last" log message from startup has been indexed. That is part of what what the healthcheck
function is doing. This does make the test much more consistent. This is also why the test disables monitoring and then restarts before enabling it with otel mode. But I haven't found a way to make this totally deterministic.
There is always a window where elastic-agent can produce a log, elastic-agent or agentbeat "sees" the log and tries to send it, then we kill elastic-agent. In that scenario the registry on disk will always be behind what elasticsearch has indexed and we will produce a duplicate on the next restart. But this should be a small number of duplicates and should not be the "entire file" which is what we are trying to make sure isn't happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle, there shouldn't be any duplicates in this case, right?
I started out with this assumption too. But, we don't have a "only once" delivery guarantee, we have a "at least once" delivery guarantee. Because of this having occasional duplicates is "normal behavior", but re-ingesting the whole file would not be normal behavior. Which is why I switched to accepting a "small percentage" of duplicates. Even if we shut down "in an orderly way", there are still timeouts that could cause us to exit without receiving a reply from elasticsearch even though the elasticsearch bulk request eventually succeeds. And these timeouts are necessary because we don't want to prevent shutting down due to network issues etc.
Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, comments are non-blocking. I am somewhat concerned that we're tolerating too many duplicates, though.
policy.Agent.Monitoring["enabled"] = false | ||
updatedPolicyBytes, err = yaml.Marshal(policy) | ||
require.NoErrorf(t, err, "error marshalling policy, struct was %v", policy) | ||
err = fut.Configure(ctx, updatedPolicyBytes) | ||
require.NoError(t, err) | ||
|
||
// restart to make sure beat processes have exited | ||
restartBytes, err = fut.Exec(ctx, []string{"restart"}) | ||
require.NoErrorf(t, err, "Restart error: %s, output was: %s", err, string(restartBytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is probably fine, but the more idiomatic way would be to update the policy in Kibana and wait until agent reports it as applied. Any specific reason to reload the configuration locally instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea. Do we have an example of using the API to change the runtime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@swiatekm updated to using API to make changes, plus enrolled in Fleet.
// It is possible to have a batch in flight not get acked | ||
// before elastic-agent shutsdown. That can lead to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't kill elastic-agent though, we restart it in an orderly way. In principle, there shouldn't be any duplicates in this case, right?
Putting back in draft until #10645 is fixed. |
💔 Build Failed
Failed CI Steps
History
cc @leehinman |
What does this PR do?
Adds integration test that checks if switching to otel runtime for
monitoring causes the elastic-agent log files to be re-ingested. Also
check to make sure that restarting the agent with otel runtime does
not result in re-ingesting the log files.
Why is it important?
re-ingestion is wasteful and if restarts caused state to be lost we
could loose data.
Checklist
./changelog/fragments
using the changelog toolDisruptive User Impact
None. Integration test only
How to test this PR locally
Related issues
Questions to ask yourself